Business Analytics

Project description

You've been offered an internship in the analytical department at Yandex.Afisha. Your first task is to help optimize marketing expenses. You have:

  • Server logs with data on Yandex.Afisha visits from June 2017 through May 2018
  • Dump file with all orders for the period
  • Marketing expenses statistics

You are going to study:

  • How people use the product
  • When they start to buy
  • How much money each customer brings
  • When they pay off

Step 1. Download the data and prepare it for analysis

In [1]:
import math
import pandas as pd
from scipy import stats as st
import numpy as np
import matplotlib.pyplot as plt
import warnings
import plotly.express as px
import seaborn as sns
import matplotlib.colors as mcolors
import matplotlib.dates as mdates
warnings.filterwarnings ("ignore")

%matplotlib inline

Downloading the data sets

In [2]:
local_path1 = '/Users/yoni/Pictures/jupyter/Business Analytics/visits_log_us.csv'
local_path2 = '/Users/yoni/Pictures/jupyter/Business Analytics/orders_log_us.csv'
local_path3 = '/Users/yoni/Pictures/jupyter/Business Analytics/costs_us.csv'
In [3]:
visits = pd.read_csv(local_path1, 
                  dtype={'Device': 'category','Source Id':'category'},
                  parse_dates=['Start Ts', 'End Ts'])

orders = pd.read_csv(local_path2, 
                  parse_dates=['Buy Ts'])   

costs = pd.read_csv(local_path3, 
                  dtype={'source_id':'category'},
                  parse_dates=['dt'])
In [4]:
visits.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 359400 entries, 0 to 359399
Data columns (total 5 columns):
 #   Column     Non-Null Count   Dtype         
---  ------     --------------   -----         
 0   Device     359400 non-null  category      
 1   End Ts     359400 non-null  datetime64[ns]
 2   Source Id  359400 non-null  category      
 3   Start Ts   359400 non-null  datetime64[ns]
 4   Uid        359400 non-null  uint64        
dtypes: category(2), datetime64[ns](2), uint64(1)
memory usage: 8.9 MB
In [5]:
orders.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50415 entries, 0 to 50414
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype         
---  ------   --------------  -----         
 0   Buy Ts   50415 non-null  datetime64[ns]
 1   Revenue  50415 non-null  float64       
 2   Uid      50415 non-null  uint64        
dtypes: datetime64[ns](1), float64(1), uint64(1)
memory usage: 1.2 MB
In [6]:
costs.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2542 entries, 0 to 2541
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   source_id  2542 non-null   category      
 1   dt         2542 non-null   datetime64[ns]
 2   costs      2542 non-null   float64       
dtypes: category(1), datetime64[ns](1), float64(1)
memory usage: 43.1 KB

Preparing the data and studying the general information

Visits

In [7]:
visits.columns=visits.columns.str.lower() #changing the column names to lowercase
In [8]:
visits.head(10)
Out[8]:
device end ts source id start ts uid
0 touch 2017-12-20 17:38:00 4 2017-12-20 17:20:00 16879256277535980062
1 desktop 2018-02-19 17:21:00 2 2018-02-19 16:53:00 104060357244891740
2 touch 2017-07-01 01:54:00 5 2017-07-01 01:54:00 7459035603376831527
3 desktop 2018-05-20 11:23:00 9 2018-05-20 10:59:00 16174680259334210214
4 desktop 2017-12-27 14:06:00 3 2017-12-27 14:06:00 9969694820036681168
5 desktop 2017-09-03 21:36:00 5 2017-09-03 21:35:00 16007536194108375387
6 desktop 2018-01-30 12:09:00 1 2018-01-30 11:13:00 6661610529277171451
7 touch 2017-11-05 15:15:00 3 2017-11-05 15:14:00 11423865690854540312
8 desktop 2017-07-19 10:44:00 3 2017-07-19 10:41:00 2987360259350925644
9 desktop 2017-11-08 13:43:00 5 2017-11-08 13:42:00 1289240080042562063
In [9]:
visits.tail(10)
Out[9]:
device end ts source id start ts uid
359390 desktop 2017-12-11 17:54:19 1 2017-12-11 17:54:00 18268369011883553090
359391 touch 2017-08-30 11:53:19 2 2017-08-30 11:53:00 18304232853749481866
359392 desktop 2017-12-30 21:49:19 4 2017-12-30 21:49:00 18311497973035795342
359393 desktop 2017-06-08 07:49:19 4 2017-06-08 07:49:00 18325675073358289850
359394 touch 2017-11-06 08:00:19 4 2017-11-06 08:00:00 18332905065174083046
359395 desktop 2017-07-29 19:07:19 2 2017-07-29 19:07:00 18363291481961487539
359396 touch 2018-01-25 17:38:19 1 2018-01-25 17:38:00 18370831553019119586
359397 desktop 2018-03-03 10:12:19 4 2018-03-03 10:12:00 18387297585500748294
359398 desktop 2017-11-02 10:12:19 5 2017-11-02 10:12:00 18388616944624776485
359399 touch 2017-09-10 13:13:19 2 2017-09-10 13:13:00 18396128934054549559
In [10]:
# Changing the names of the columns 
visits = visits.rename(columns={'end ts': 'end_ts','source id':'source_id','start ts':'start_ts'}) 
In [11]:
print('% Of values in Device\n\n{}'.format(visits.device.value_counts()/len(visits)))
% Of values in Device

desktop    0.73057
touch      0.26943
Name: device, dtype: float64
In [12]:
print('% Of values in Source Id\n\n{}'.format(visits['source_id'].value_counts()/len(visits)))
% Of values in Source Id

4     0.283233
3     0.238203
5     0.186157
2     0.132515
1     0.094939
9     0.036942
10    0.027894
7     0.000100
6     0.000017
Name: source_id, dtype: float64
In [13]:
visits[visits.duplicated(['end_ts','start_ts','uid'],keep=False)] #Looking for duplicates
Out[13]:
device end_ts source_id start_ts uid

In the 'visits' data frame there are no duplicates and no missing values. We can see that users loged in much more on a desktop device (73%) then from a touch screen (24%)..

Orders

In [14]:
orders.columns=orders.columns.str.lower() #changing the column names to lowercase

orders = orders.rename(columns={'buy ts': 'buy_ts'}) # Changing the names of the columns 
In [15]:
orders.head(10)
Out[15]:
buy_ts revenue uid
0 2017-06-01 00:10:00 17.00 10329302124590727494
1 2017-06-01 00:25:00 0.55 11627257723692907447
2 2017-06-01 00:27:00 0.37 17903680561304213844
3 2017-06-01 00:29:00 0.55 16109239769442553005
4 2017-06-01 07:58:00 0.37 14200605875248379450
5 2017-06-01 08:43:00 0.18 10402394430196413321
6 2017-06-01 08:54:00 1.83 12464626743129688638
7 2017-06-01 09:22:00 1.22 3644482766749211722
8 2017-06-01 09:22:00 3.30 17542070709969841479
9 2017-06-01 09:23:00 0.37 1074355127080856382
In [16]:
orders.tail(10)
Out[16]:
buy_ts revenue uid
50405 2018-05-31 23:36:00 0.61 1551265230644458155
50406 2018-05-31 23:40:00 1.16 13121453305610619664
50407 2018-05-31 23:40:00 2.26 15576344609395725707
50408 2018-05-31 23:42:00 0.92 15854392578508828030
50409 2018-05-31 23:43:00 3.67 1805512004303848280
50410 2018-05-31 23:50:00 4.64 12296626599487328624
50411 2018-05-31 23:50:00 5.80 11369640365507475976
50412 2018-05-31 23:54:00 0.30 1786462140797698849
50413 2018-05-31 23:56:00 3.67 3993697860786194247
50414 2018-06-01 00:02:00 3.42 83872787173869366
In [17]:
orders.describe()
Out[17]:
revenue uid
count 50415.000000 5.041500e+04
mean 4.999647 9.098161e+18
std 21.818359 5.285742e+18
min 0.000000 3.135781e+14
25% 1.220000 4.533567e+18
50% 2.500000 9.102274e+18
75% 4.890000 1.368290e+19
max 2633.280000 1.844617e+19

We can see there are no missing values or duplicetes in the 'orders' df.

The 'Min' revenue is '0' we should look in to that it is not really possible to buy somthing for free it is an oxymoron.

In [18]:
orders[orders.duplicated(['buy_ts','uid'],keep=False)]
Out[18]:
buy_ts revenue uid
In [19]:
len(orders.query('revenue == 0'))
Out[19]:
51

I have 51 rows with a revenue equal to '0' i will detet these rows

In [20]:
orders_clean = orders[orders['revenue'] != 0]
orders_clean.reset_index(drop=True,inplace=True)
len(orders_clean.query('revenue == 0'))
Out[20]:
0
In [21]:
orders_clean.tail() # new df with no '0' in revenue
Out[21]:
buy_ts revenue uid
50359 2018-05-31 23:50:00 4.64 12296626599487328624
50360 2018-05-31 23:50:00 5.80 11369640365507475976
50361 2018-05-31 23:54:00 0.30 1786462140797698849
50362 2018-05-31 23:56:00 3.67 3993697860786194247
50363 2018-06-01 00:02:00 3.42 83872787173869366

Costs

In [22]:
costs.head(10)
Out[22]:
source_id dt costs
0 1 2017-06-01 75.20
1 1 2017-06-02 62.25
2 1 2017-06-03 36.53
3 1 2017-06-04 55.00
4 1 2017-06-05 57.08
5 1 2017-06-06 40.39
6 1 2017-06-07 40.59
7 1 2017-06-08 56.63
8 1 2017-06-09 40.16
9 1 2017-06-10 43.24
In [23]:
costs.tail(10)
Out[23]:
source_id dt costs
2532 10 2018-05-22 21.70
2533 10 2018-05-23 7.42
2534 10 2018-05-24 10.79
2535 10 2018-05-25 22.21
2536 10 2018-05-26 7.89
2537 10 2018-05-27 9.92
2538 10 2018-05-28 21.26
2539 10 2018-05-29 11.32
2540 10 2018-05-30 33.15
2541 10 2018-05-31 17.60
In [24]:
costs.describe()
Out[24]:
costs
count 2542.000000
mean 129.477427
std 156.296628
min 0.540000
25% 21.945000
50% 77.295000
75% 170.065000
max 1788.280000
In [25]:
costs[costs.duplicated(keep=False)]
Out[25]:
source_id dt costs

Costs df looks to be good to go.

Step 2. Make reports and calculate metrics

Product

How many people use it every day, week, and month?

For this task I will calculate the DAU, WAU and MAU. The number of active unique users for each day, week and month.

In [26]:
visits.head(5)
Out[26]:
device end_ts source_id start_ts uid
0 touch 2017-12-20 17:38:00 4 2017-12-20 17:20:00 16879256277535980062
1 desktop 2018-02-19 17:21:00 2 2018-02-19 16:53:00 104060357244891740
2 touch 2017-07-01 01:54:00 5 2017-07-01 01:54:00 7459035603376831527
3 desktop 2018-05-20 11:23:00 9 2018-05-20 10:59:00 16174680259334210214
4 desktop 2017-12-27 14:06:00 3 2017-12-27 14:06:00 9969694820036681168
In [27]:
visits['session_year']  = visits['start_ts'].dt.year
visits['session_month']  = visits['start_ts'].dt.month
visits['session_week']  = visits['start_ts'].dt.week
visits['session_date']  = visits['start_ts'].dt.date

visits['month_year'] = pd.to_datetime(visits['start_ts']).dt.to_period('M')
visits['session_date']=pd.to_datetime(visits['session_date'])
In [28]:
visits.head(5)
Out[28]:
device end_ts source_id start_ts uid session_year session_month session_week session_date month_year
0 touch 2017-12-20 17:38:00 4 2017-12-20 17:20:00 16879256277535980062 2017 12 51 2017-12-20 2017-12
1 desktop 2018-02-19 17:21:00 2 2018-02-19 16:53:00 104060357244891740 2018 2 8 2018-02-19 2018-02
2 touch 2017-07-01 01:54:00 5 2017-07-01 01:54:00 7459035603376831527 2017 7 26 2017-07-01 2017-07
3 desktop 2018-05-20 11:23:00 9 2018-05-20 10:59:00 16174680259334210214 2018 5 20 2018-05-20 2018-05
4 desktop 2017-12-27 14:06:00 3 2017-12-27 14:06:00 9969694820036681168 2017 12 52 2017-12-27 2017-12
In [29]:
# MAU
mau_total = visits.groupby(['month_year'],as_index=False).agg({'uid': 'nunique'})
mau_total
Out[29]:
month_year uid
0 2017-06 13259
1 2017-07 14183
2 2017-08 11631
3 2017-09 18975
4 2017-10 29692
5 2017-11 32797
6 2017-12 31557
7 2018-01 28716
8 2018-02 28749
9 2018-03 27473
10 2018-04 21008
11 2018-05 20701
In [30]:
# WAU
wau_total = visits.groupby(['session_year', 'session_week'],as_index=False).agg({'uid': 'nunique'})
wau_total.head()
Out[30]:
session_year session_week uid
0 2017 22 2021
1 2017 23 4129
2 2017 24 2812
3 2017 25 2878
4 2017 26 3064
In [31]:
# DAU
dau_total = visits.groupby(['session_date'],as_index=False).agg({'uid': 'nunique'})
dau_total.head()
Out[31]:
session_date uid
0 2017-06-01 605
1 2017-06-02 608
2 2017-06-03 445
3 2017-06-04 476
4 2017-06-05 820
In [32]:
plt.figure(figsize=(15,8))
ax=sns.barplot(x='month_year', y='uid', data=mau_total)
ax.set_title('Number Of Active Unique Users Each Month', fontdict={'size':15})

for p in ax.patches:
    ax.annotate(format(p.get_height(), '.0f'),
                   (p.get_x() + p.get_width() / 2.,
                    p.get_height()), ha = 'center', va = 'center',
                   xytext = (0, -12), textcoords = 'offset points',fontsize=14,color='w')

plt.xticks(rotation=30)
plt.xlabel("Month")
plt.ylabel("Number Of Sessions")



plt.show()
In [33]:
plt.figure(figsize=(15,8))
sns.lineplot(x='session_week', y='uid', data=wau_total,linewidth = 3).set_title('Number Of Active Unique Users Each Week'
                                                               , fontdict={'size':15})
plt.xlabel("Week")
x_ticks = np.arange(0, 55, 4)
plt.xticks(x_ticks)
plt.ylabel("Number Of Sessions")
plt.grid()
plt.show()
In [34]:
plt.figure(figsize=(15,8))
sns.lineplot(x='session_date', y='uid', data=dau_total,linewidth = 2).set_title('Number Of Active Unique Users Each Day'
                                                               , fontdict={'size':15})
plt.xlabel("date")

plt.ylabel("Number Of Sessions")

plt.grid()
plt.show()

In these three graphs we can see basically the same information but in different resolutions.

The first graph shows us unique users per month and there we can see that we have more visitors on average close to the holyday season. The number of sessions drop after March and and is droping untin September and then it rises again until the peek at November.

In the second graph we can see unique users per week, This graph on a whole tells us a similar story but we can see more fluctuations. Those changes can be explained with people being effected by holydays, Maybe some advertizing campaign or even when the paychek from work arrives Thing along this line.

The third graph is the same only by day, Here we can see in even higher resolution. again the simmilar story only this time we can clearly see a zig-zag pattern that i can guesse comes from the changes in peoples behaviour during the weekend. On top of that we can see that there are two verry distinct peeks, One of which is very high and the other very low. The high peek is around Christmass time but the other low peek seems to be at a random time. I would think about looking in to those couple of days in April.

How many sessions are there per day?

In [35]:
daily_sessions = visits.groupby('session_date',as_index=False).agg({'uid':'count'})
In [36]:
avg_daily_ses = int(visits.groupby('session_date',as_index=False).agg({'uid':'count'}).mean())
print('Number of average daily sessions :{}'.format(avg_daily_ses))
Number of average daily sessions :987
In [37]:
daily_sessions.head()
Out[37]:
session_date uid
0 2017-06-01 664
1 2017-06-02 658
2 2017-06-03 477
3 2017-06-04 510
4 2017-06-05 893
In [38]:
plt.figure(figsize=(15,8))
sns.lineplot(x='session_date', y='uid', data=daily_sessions,linewidth = 2).set_title('Average Number Of Sessions Per Day'
                                                               , fontdict={'size':15})


plt.xlabel("date")

plt.ylabel("Number Of Average Sessions")
plt.axhline(avg_daily_ses,color = 'red', linewidth= 1,label ='Overall Average')
plt.legend()

plt.grid()
plt.show()

This graph is almost identicasl to the 'Number Of Active Unique Users Each Day' graph from the last section. The number of sessions per day is very close to the number of active users per day.

now i will plot both graphs together and see how close they are.

In [39]:
plt.figure(figsize=(15,6))
sns.lineplot(x='session_date', y='uid', data=daily_sessions,linewidth = 1).set_title('Average Number Of Sessions Vs Number Of Active Unique Users Per Day'
                                                               , fontdict={'size':15})

sns.lineplot(x='session_date', y='uid', data=dau_total,linewidth = 1)

plt.xlabel("date")

plt.ylabel("Number Of Average Sessions/ Users")

plt.grid()
plt.show()

Here we can really see that the two graphs are almost identical, One interesting thing to note is that on the high peek it seems like the number of sessions was high compared with the number of active users. This means that some users had a high number of sessions.

In [40]:
# Adding a column with avg number of sessions per user per day
ses_per_user = daily_sessions
ses_per_user['avg_per_user'] = ses_per_user['uid']/dau_total['uid'] 
ses_per_user
Out[40]:
session_date uid avg_per_user
0 2017-06-01 664 1.097521
1 2017-06-02 658 1.082237
2 2017-06-03 477 1.071910
3 2017-06-04 510 1.071429
4 2017-06-05 893 1.089024
... ... ... ...
359 2018-05-27 672 1.083871
360 2018-05-28 1156 1.112608
361 2018-05-29 1035 1.091772
362 2018-05-30 1410 1.093871
363 2018-05-31 2256 1.129695

364 rows × 3 columns

In [41]:
plt.figure(figsize=(15,8))
sns.lineplot(x='session_date', y='avg_per_user', data=ses_per_user,linewidth = 2).set_title('Average Number Of Sessions Per Day Per User'
                                                               , fontdict={'size':15})
plt.xlabel("date")

plt.ylabel("Number Of Average Sessions")

plt.grid()
plt.show()

In this graph we can see the average number of sessions per user per day. We can see that the high spike we had before around christmass time is still there but we can see that when we look at the average number of sessions we have two spikes around christmass time.

Now i will make a graph that shows average number of sessions per day per device.

In [42]:
# making the pivot table

visits_by_device = visits.pivot_table(index= 'session_date', columns ='device', values='uid',aggfunc='count')
visits_by_device.columns = visits_by_device.columns.categories
visits_by_device.reset_index(inplace=True)

visits_by_device.head()
Out[42]:
session_date desktop touch
0 2017-06-01 501.0 163.0
1 2017-06-02 498.0 160.0
2 2017-06-03 329.0 148.0
3 2017-06-04 349.0 161.0
4 2017-06-05 693.0 200.0
In [43]:
plt.figure(figsize=(10,6))
plt.plot(visits_by_device.session_date,visits_by_device.desktop, label='Desktop')
plt.plot(visits_by_device.session_date,visits_by_device.touch, label='Touch')

plt.xlabel("date")


plt.ylabel("Number Of Average Users")
plt.title('Average Visits By Device')


plt.legend()
plt.grid()
plt.show()

In this graph we can see the average visits from the two devices: Desktop and Touch. From the desktop, As we saw earlier, We have much more visits but overall the behavior is pretty similar.

What is the length of each session

In [44]:
session_time = visits[['start_ts','end_ts','session_date','uid','device']]
session_time['sesion_length']=session_time['end_ts']-session_time['start_ts']
session_time['sesion_length']=session_time['sesion_length'].dt.seconds
In [45]:
avg_session_time = session_time['sesion_length'].mean()
print('the average session is {} seconds long or about {} minuts long'.format(int(avg_session_time),int((avg_session_time)/60)))    
the average session is 643 seconds long or about 10 minuts long
In [46]:
session_time_group = session_time.groupby(['session_date'],as_index=False).agg({'sesion_length':'mean'})

plt.figure(figsize=(10,6))
plt.plot(session_time_group.session_date,session_time_group.sesion_length)


plt.xlabel("date")
plt.ylabel("Seconds")
plt.title('Average Session length')
plt.axhline(avg_session_time,color = 'red', linewidth= 1,label ='Overall Average')
plt.legend()

plt.grid()
plt.show()

Once again the graph shows us that even the sesion lengths are compatible with all the other parameters we looked at untile now. The same spike arount Christmass and the same low around april (someone shold really see what went on that april). All and all this graph looks to be more even then the others across the year.

I just want to check different platforms and see if the average session time changes

In [47]:
session_time_desktop = session_time.query('device =="desktop"')
session_time_touch = session_time.query('device =="touch"')
average_session_time_desktop = session_time_desktop['sesion_length'].mean()
average_session_time_touch =session_time_touch['sesion_length'].mean()
average_session_time_desktop
average_session_time_touch
print('The average session on a desktop is: {} seconds\nThe average session on a touch is: {} seconds'.format(average_session_time_desktop,average_session_time_touch))
The average session on a desktop is: 703.9469773429258 seconds
The average session on a touch is: 479.6194066072517 seconds

We can see that people on desktop spend allmost 40% more time on the site then people with touch devices. Is that good or bad? it seems good but i think its a bit to early to say for sure. Maybe the site on the touch devices is more streamline? These are interesting questions that needs to be asked.

In [48]:
plt.figure(figsize=(15,6))
plt.hist(session_time.sesion_length,bins=11,range=(300,900),color='c',label='Data')
plt.axvline(session_time.sesion_length.mean(), color='k', linestyle='dashed',label='Average')
plt.axvline(session_time.sesion_length.median(), color='r', linestyle='dashed',label='Median')


plt.xlabel("Session Time (sec)")
plt.ylabel("Number Of Sessions")
plt.title('Session length')
plt.legend()
plt.show()

In the histogram above we can see the distribution of the session times. We can see that the average session time is a bit more than 600 seconds (10 min). We can see that that this is the tail end of an exponential distribution, We have most of the users sessions less than the average but we have some verry high outliers. from the shorter sessions that are the majority we slowly climb up the sesion time but the cases are getting fewer and fewer until the session time will be verry long but very rare.

How often do users come back

To answer this i will find the retention by month.

In [49]:
visits_first_1 = visits.groupby(['uid'])['start_ts'].min()
visits_first_1.name = 'first_activity_date'
visits_big = visits
visits_big = visits_big.join(visits_first_1,on='uid')
In [50]:
visits_big['session_month']= visits_big['session_date'].astype('datetime64[M]')
visits_big['first_session_month'] = visits_big['first_activity_date'].astype('datetime64[M]')





visits_big['cohort_lifetime']= visits_big['session_month']- visits_big['first_session_month']
visits_big['cohort_lifetime'] = visits_big['cohort_lifetime']/ np.timedelta64(1,'M')
visits_big['cohort_lifetime'] = visits_big['cohort_lifetime'].round().astype(int)

cohorts = visits_big.groupby(['first_session_month','cohort_lifetime']).agg({'uid':'nunique'}).reset_index()
initial_users_count = cohorts[cohorts['cohort_lifetime']==0][['first_session_month','uid']]
initial_users_count = initial_users_count.rename(columns={'uid':'cohort_users'})
cohorts = cohorts.merge(initial_users_count,on='first_session_month')
cohorts['retention'] =cohorts['uid']/cohorts['cohort_users']
cohorts['just_date'] = cohorts['first_session_month'].dt.date


retention_pivot = cohorts.pivot_table(index='just_date',columns='cohort_lifetime', values='retention', aggfunc='sum')
In [51]:
retention_pivot
Out[51]:
cohort_lifetime 0 1 2 3 4 5 6 7 8 9 10 11
just_date
2017-06-01 1.0 0.078664 0.053775 0.061392 0.068557 0.071423 0.061015 0.057772 0.052342 0.050833 0.040652 0.044951
2017-07-01 1.0 0.056088 0.051294 0.056164 0.058219 0.048174 0.045358 0.045738 0.038813 0.028615 0.027473 NaN
2017-08-01 1.0 0.076908 0.062862 0.062764 0.050093 0.044004 0.036342 0.039485 0.027895 0.026029 NaN NaN
2017-09-01 1.0 0.085489 0.069205 0.050706 0.039392 0.037835 0.035860 0.024186 0.022809 NaN NaN NaN
2017-10-01 1.0 0.078608 0.052239 0.038958 0.034261 0.032221 0.021365 0.020364 NaN NaN NaN NaN
2017-11-01 1.0 0.078281 0.044113 0.038682 0.033727 0.023415 0.021800 NaN NaN NaN NaN NaN
2017-12-01 1.0 0.055802 0.037993 0.031107 0.020263 0.019036 NaN NaN NaN NaN NaN NaN
2018-01-01 1.0 0.059715 0.039339 0.024973 0.020244 NaN NaN NaN NaN NaN NaN NaN
2018-02-01 1.0 0.057080 0.025454 0.020093 NaN NaN NaN NaN NaN NaN NaN NaN
2018-03-01 1.0 0.041818 0.027053 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2018-04-01 1.0 0.048380 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2018-05-01 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
In [52]:
sns.set(style='white')
plt.figure(figsize=(15, 8))

cmap = mcolors.LinearSegmentedColormap.from_list(name="n",colors=['azure','purple','blue','green','yellow','orange','red'])

sns.heatmap(retention_pivot, annot=True, fmt='.1%', linewidths=1, linecolor='gray',cmap=cmap)


plt.title('Cohorts: User Retention', fontdict={'size':15})
plt.xlabel("Retention")
plt.ylabel("First Session")
plt.show()

We can see in the Hitmap that the first cohort had a rise between the 4th and 5th month besides that all the cohorts are going steadily down. we can also see that for some reason all the cohorts got to around 2%-3% in their last month (excluding the first and last cohorts) this is not good becouse as time goes by it takes them a shorter and shorter time to get to 2%, We can see in the cohort of 7.2017 it took 9 months to get this low and the 4.2018 it took only 2 months to get to 2% retention rate. the users seem to be leaving more and more.

Sales

When do people start buying?

I will calculate the convertion rate

In [53]:
# Grouping by user and first purchase date
orders_grouped = orders.groupby(['uid'])['buy_ts'].min()
orders_grouped.name = 'first_buy_date'
orders_grouped.head()
Out[53]:
uid
313578113262317    2018-01-03 21:51:00
1575281904278712   2017-06-03 10:13:00
2429014661409475   2017-10-11 18:33:00
2464366381792757   2018-01-28 15:54:00
2551852515556206   2017-11-24 10:14:00
Name: first_buy_date, dtype: datetime64[ns]
In [54]:
visits_first = visits_big[['uid','first_activity_date','cohort_lifetime','device','first_session_month','source_id']]

visits_first = visits_first.merge(orders_grouped, on='uid',how='outer')
visits_first['first_activity_date'] = visits_first['first_activity_date'].dt.date
visits_first['first_buy_date'] = visits_first['first_buy_date'].dt.date
visits_first['convertion'] = visits_first['first_buy_date']- visits_first['first_activity_date']
In [55]:
visits_first.drop_duplicates(subset='uid',inplace=True)
visits_first.reset_index(inplace=True, drop=True)
visits_first.head(10)
Out[55]:
uid first_activity_date cohort_lifetime device first_session_month source_id first_buy_date convertion
0 16879256277535980062 2017-12-20 0 touch 2017-12-01 4 NaT NaT
1 104060357244891740 2018-02-19 0 desktop 2018-02-01 2 NaT NaT
2 7459035603376831527 2017-07-01 0 touch 2017-07-01 5 NaT NaT
3 16174680259334210214 2018-03-09 2 desktop 2018-03-01 9 2018-03-09 0 days
4 9969694820036681168 2017-12-27 0 desktop 2017-12-01 3 NaT NaT
5 16007536194108375387 2017-09-03 0 desktop 2017-09-01 5 2017-09-04 1 days
6 6661610529277171451 2017-06-29 7 desktop 2017-06-01 1 NaT NaT
7 11423865690854540312 2017-11-05 0 touch 2017-11-01 3 NaT NaT
8 2987360259350925644 2017-07-19 0 desktop 2017-07-01 3 NaT NaT
9 1289240080042562063 2017-11-08 0 desktop 2017-11-01 5 NaT NaT
In [56]:
no_convertion = len(visits_first.query('convertion == "nan"'))
print('% of users that did not convert at all is : {}'.format(no_convertion/len(visits_first)))
% of users that did not convert at all is : 0.8399300518475341

There is a really big percentage of users (84%) that did not convert at all

In [57]:
desktop_buy = visits_first.query('device == "desktop"')
touch_buy = visits_first.query('device == "touch"')
desktop_no_convertion= len(desktop_buy.query('convertion == "nan"'))
touch_no_convertion = len(touch_buy.query('convertion == "nan"'))

print('% of users on a desktop that did not convert at all is : {}'.format(desktop_no_convertion/len(desktop_buy)))
print('% of users on a touch device that did not convert at all is : {}'.format(touch_no_convertion/len(touch_buy)))
% of users on a desktop that did not convert at all is : 0.8184371252648387
% of users on a touch device that did not convert at all is : 0.8913897236023921

We can see that users that came to our site on a touch device converted less then users from a desktop. The difference is sagnificant around 8%.

In [58]:
convertions= visits_first.dropna(subset=['convertion']).reset_index(drop=True)
convertions['convertion']=convertions['convertion'].dt.days.astype('int16')
convertions.head()
Out[58]:
uid first_activity_date cohort_lifetime device first_session_month source_id first_buy_date convertion
0 16174680259334210214 2018-03-09 2 desktop 2018-03-01 9 2018-03-09 0
1 16007536194108375387 2017-09-03 0 desktop 2017-09-01 5 2017-09-04 1
2 8056418121947262981 2017-06-04 8 desktop 2017-06-01 4 2017-06-25 21
3 18188358787673499603 2018-02-10 0 touch 2018-02-01 2 2018-02-12 2
4 2307401184189569202 2017-09-23 3 desktop 2017-09-01 4 2017-09-27 4
In [59]:
convertion_0 = convertions.query('convertion == 0')
convertion_week = convertions.query('convertion > 0 & convertion < 8')
convertion_2weeks = convertions.query('convertion >= 8 & convertion < 15')
convertion_month =  convertions.query('convertion >= 15 & convertion <= 30')
convertion_over_month =convertions.query('convertion > 30')
In [60]:
print('% of users that converted on first day: {}'.format(len(convertion_0)/len(convertions)))
print('% of users that converted on first week: {}'.format(len(convertion_week)/len(convertions)))
print('% of users that converted on second week: {}'.format(len(convertion_2weeks)/len(convertions)))
print('% of users that converted on third or forth week: {}'.format(len(convertion_2weeks)/len(convertions)))
print('% of users that converted after the first month: {}'.format(len(convertion_over_month)/len(convertions)))
% of users that converted on first day: 0.6855679982476796
% of users that converted on first week: 0.11765189058949155
% of users that converted on second week: 0.026668126933713003
% of users that converted on third or forth week: 0.026668126933713003
% of users that converted after the first month: 0.13509295512416833

If a user converted usually it happends in the first day (68%), 11% of converted users will do so on their first week (excluding the first day), So within the first week we can see that close to 80% of the convertions happen. there is a 5% chance a user that did not convert on the first week will convert on the first month, But there is still hope becouse 13.5% of converted users do so after the first month.

Next i will look at each cohort and find the % of converted users that converted within the first week.

</span>

In [61]:
convertion_1st_week = convertions.query('convertion < 8')
convertion_cohorts_1st_week = convertion_1st_week.groupby(['first_session_month']).agg({'uid':'nunique'}).reset_index()
convertion_cohorts_1st_week.rename(columns={'uid':'convertions'},inplace=True)

convertion_cohorts = convertions.groupby(['first_session_month']).agg({'uid':'nunique'}).reset_index()


convertion_cohorts= convertion_cohorts.merge(convertion_cohorts_1st_week,on='first_session_month')
convertion_cohorts['first_session_month']= convertion_cohorts['first_session_month'].astype(str)
convertion_cohorts['% convertion']= convertion_cohorts['convertions']/convertion_cohorts['uid']
convertion_cohorts
Out[61]:
first_session_month uid convertions % convertion
0 2017-06-01 2923 1978 0.676702
1 2017-07-01 2458 1697 0.690399
2 2017-08-01 1721 1188 0.690296
3 2017-09-01 3058 2220 0.725965
4 2017-10-01 4678 3660 0.782386
5 2017-11-01 4262 3272 0.767715
6 2017-12-01 4074 3364 0.825724
7 2018-01-01 3119 2599 0.833280
8 2018-02-01 3186 2790 0.875706
9 2018-03-01 2838 2584 0.910500
10 2018-04-01 1890 1733 0.916931
11 2018-05-01 2316 2251 0.971934
In [62]:
plt.figure(figsize=(15,8))
ax= sns.barplot(x='first_session_month', y='% convertion', data=convertion_cohorts)
ax.set_title('First Week Convertion By Cohort (converted only)', fontdict={'size':15})

for p in ax.patches:
    ax.annotate(format(p.get_height(), '.2f'),
                   (p.get_x() + p.get_width() / 2.,
                    p.get_height()), ha = 'center', va = 'center',
                   xytext = (0, -12), textcoords = 'offset points',fontsize=14,color='w')


plt.xticks(rotation=30)
plt.xlabel("Cohort Month")
plt.ylabel("Mean Convertion Days")
plt.show()

We can see in the graph above that in each new cohort the convertion rate within the first week is going up. This is of cours only looking at the users that have converted.

Next I will do the same only this time see the % of users that converted within the first week from the total number of users in the cohort(not only from the converted users)

In [63]:
all_visits_cohorts = visits_first.groupby('first_session_month').agg({'uid':'nunique'}).reset_index()
all_visits_cohorts = all_visits_cohorts.merge(convertion_cohorts_1st_week, on='first_session_month')

convertions_total = visits_first.query('convertion != "NaT"')
convertions_total_cohorts = convertions_total.groupby('first_session_month').agg({'uid':'nunique'}).reset_index()
convertions_total_cohorts.rename(columns={'uid':'total_convertions'},inplace=True)

all_visits_cohorts = all_visits_cohorts.merge(convertions_total_cohorts, on='first_session_month')

all_visits_cohorts['first_session_month'] = all_visits_cohorts['first_session_month'].astype(str)
all_visits_cohorts['% convertions'] =all_visits_cohorts['convertions']/all_visits_cohorts['uid']
all_visits_cohorts['% total_convertions'] =all_visits_cohorts['total_convertions']/all_visits_cohorts['uid']
In [64]:
plt.figure(figsize=(15,10))
ax = sns.barplot(x='first_session_month', y='% total_convertions', data=all_visits_cohorts, color = "red")
ax.set_title('Convertion By Cohort (all users)', fontdict={'size':15})




bottom_plot = sns.barplot(x = 'first_session_month', y = '% convertions',data=all_visits_cohorts, color = "#0000A3")




topbar = plt.Rectangle((0,0),1,1,fc="red", edgecolor = 'none')
bottombar = plt.Rectangle((0,0),1,1,fc='#0000A3',  edgecolor = 'none')
l = plt.legend([bottombar, topbar], ['First Week Convertions', 'Total Convertions'], loc=1, ncol = 2, prop={'size':16})

plt.xticks(rotation=30)
plt.xlabel("Cohort Month")
plt.ylabel("% of convertions")
plt.show()

In this graph we can see the total % of convertions from each cohort and the % of the convertions within the first week. The first cohort is in the lead of total convertions, The last cohort and the first are in the lead for convertions on the first week. There is a downward trend on total converions from the first cohort to one before the last, But we can see that convertions for the first week go up and down but seem to overall be stable around the 0.13%.

In [65]:
def convertion_time(row):
    convertion_time=row['convertion']
    if convertion_time == 0:
        return '00d'
    elif convertion_time < 8:
        return '07d'
    elif convertion_time < 15:
        return '14d'
    elif convertion_time < 31:
        return '30d'
    else:
        return '31+d'
    

convertions['convertion_days']=convertions.apply(convertion_time,axis=1)
In [66]:
convertions.head()
Out[66]:
uid first_activity_date cohort_lifetime device first_session_month source_id first_buy_date convertion convertion_days
0 16174680259334210214 2018-03-09 2 desktop 2018-03-01 9 2018-03-09 0 00d
1 16007536194108375387 2017-09-03 0 desktop 2017-09-01 5 2017-09-04 1 07d
2 8056418121947262981 2017-06-04 8 desktop 2017-06-01 4 2017-06-25 21 30d
3 18188358787673499603 2018-02-10 0 touch 2018-02-01 2 2018-02-12 2 07d
4 2307401184189569202 2017-09-23 3 desktop 2017-09-01 4 2017-09-27 4 07d
In [67]:
convertions_groupd = convertions.groupby(['first_session_month','convertion_days']).agg({'uid':'nunique'}).reset_index()

convertions_groupd = convertions_groupd.merge(initial_users_count,on='first_session_month')
convertions_groupd['%convertion'] = convertions_groupd['uid']/convertions_groupd['cohort_users']
convertions_groupd['just_date'] = convertions_groupd['first_session_month'].dt.date
convertions_groupd.head()
Out[67]:
first_session_month convertion_days uid cohort_users %convertion just_date
0 2017-06-01 00d 1716 13259 0.129422 2017-06-01
1 2017-06-01 07d 262 13259 0.019760 2017-06-01
2 2017-06-01 14d 52 13259 0.003922 2017-06-01
3 2017-06-01 30d 97 13259 0.007316 2017-06-01
4 2017-06-01 31+d 796 13259 0.060035 2017-06-01
In [68]:
convertions_pivot = convertions_groupd.pivot_table(index='just_date',columns='convertion_days',values='%convertion',aggfunc='sum')
convertions_pivot.sort_index(axis='columns', level='convertion_days')
convertions_pivot
Out[68]:
convertion_days 00d 07d 14d 30d 31+d
just_date
2017-06-01 0.129422 0.019760 0.003922 0.007316 0.060035
2017-07-01 0.111492 0.017656 0.003881 0.004338 0.049696
2017-08-01 0.104017 0.012671 0.003634 0.006090 0.042628
2017-09-01 0.112488 0.020414 0.004969 0.008561 0.036638
2017-10-01 0.120876 0.020018 0.004619 0.007122 0.027447
2017-11-01 0.099310 0.020772 0.006019 0.007083 0.023231
2017-12-01 0.112039 0.021094 0.004235 0.005580 0.018284
2018-01-01 0.096667 0.018211 0.003978 0.005790 0.013216
2018-02-01 0.106321 0.019372 0.004370 0.005046 0.008425
2018-03-01 0.109087 0.016417 0.003303 0.004080 0.004954
2018-04-01 0.093959 0.016360 0.003628 0.003628 0.002737
2018-05-01 0.129248 0.018137 0.003143 0.001113 NaN
In [69]:
sns.set(style='white')
plt.figure(figsize=(13, 9))

#cmap = mcolors.LinearSegmentedColormap.from_list(name="n",colors=['azure','purple','blue','green','yellow','orange','red'])

sns.heatmap(convertions_pivot, annot=True, fmt='.1%', linewidths=1, linecolor='gray')


plt.title('User Convertion %', fontdict={'size':15})
plt.xlabel("Convertion Days")
plt.ylabel("First Session")
plt.show()

This heatmap shows the convertion rate for each of the cohorts. The first cohort is the strongest overall, Its clear that most of the convertions happen on the first day. all the cohorts show a similar overall behavior of not showing allmost any convertions on the first month if its not the first week (as we saw before). the last column of the 31+ days wount be acurate becouse it has no time limit and thus the first cohort will have more convertions then the one before the last, What we can say from the last column is that users tent to convert even after the first month, Those are late converters and they can reach a substantial amount.

How many orders do they make during a given period of time?

In [70]:
orders['order_date']= orders['buy_ts'].dt.date
orders['order_week']= orders['buy_ts'].dt.week
orders['order_week']= orders['buy_ts'].dt.week
orders['order_month']= orders['buy_ts'].dt.month
In [71]:
avg_dsily_purchases=int(orders.groupby('order_date').agg({'uid':'count'}).mean())
print('The average number of daily purchases is: {}'.format(avg_dsily_purchases))
The average number of daily purchases is: 138
In [72]:
orders_source = pd.merge(orders,visits_first[['source_id','first_activity_date','cohort_lifetime','uid']],on='uid')
orders_source.head()
Out[72]:
buy_ts revenue uid order_date order_week order_month source_id first_activity_date cohort_lifetime
0 2017-06-01 00:10:00 17.00 10329302124590727494 2017-06-01 22 6 1 2017-06-01 0
1 2017-06-01 00:25:00 0.55 11627257723692907447 2017-06-01 22 6 2 2017-06-01 0
2 2017-06-01 00:27:00 0.37 17903680561304213844 2017-06-01 22 6 2 2017-06-01 0
3 2017-06-01 00:29:00 0.55 16109239769442553005 2017-06-01 22 6 2 2017-06-01 0
4 2017-06-01 07:58:00 0.37 14200605875248379450 2017-06-01 22 6 3 2017-06-01 0
In [73]:
orders_source_grouped = orders_source.groupby(['source_id','order_month']).agg({'uid':'count'}).reset_index()
orders_source_grouped.head()
Out[73]:
source_id order_month uid
0 1 1 620
1 1 2 669
2 1 3 690
3 1 4 473
4 1 5 624
In [74]:
fig=px.line(orders_source_grouped, x='order_month', y='uid', color='source_id',
            labels={'order_month':'Month','uid':'Purchases'},
            title="Average Number of Purchases Per Month Per Ad Source")

fig.update_layout(
    autosize=False,
    width=1000,
    height=700)


fig.show()

In the graph above we see tha avegare purchases every month for each of the ad sources. Ad source number 4 is allmost constently at the lead, Except on the 8th month where ad source number 5 got the lead.

We can see that ad sources numbers 7 and 6 are totaly useless we can get rid of them. As too sources numbers 9 and 10 are allmost not bringing purchases at all.

In general there is a drop in the middle of the year that rises and allmost triples twards the holyday season at the end of the year.</span.

What is the average purchase size?

In [75]:
avg_purchas= orders['revenue'].mean()
print('Average of all purchases is: {:.2f}'.format(avg_purchas))
Average of all purchases is: 5.00

Now i'll calculate the average purchase for each day

In [76]:
orders_revenue=orders_source.groupby('order_date',as_index=False).agg({'revenue':'mean'})
orders_revenue.head()
Out[76]:
order_date revenue
0 2017-06-01 4.056875
1 2017-06-02 2.919910
2 2017-06-03 2.484776
3 2017-06-04 2.957727
4 2017-06-05 2.742174
In [77]:
plt.figure(figsize=(15,8))
sns.lineplot(data=orders_revenue,x='order_date',y='revenue',linewidth='2',alpha=0.8)
plt.grid()

plt.title('Average Purchases Amount',fontdict={'size':20})
plt.xlabel("Date")
plt.ylabel("Purchase Amount")
plt.axhline(avg_purchas,color = 'red', linewidth= 2,label ='Overall Average',alpha=0.5)
plt.legend(frameon=False,fontsize=15)

plt.show()

Its not a surprise that there is a huge jump in average purchase size on days before christmas. We can also see that the graph has more peaks that can probab ly coraspond with holydays though no holyday compares with christmas.

Now, like before lets look at the average purchase anount per month per ad source.

In [78]:
order_source_revenue = orders_source.groupby(['source_id','order_month']).agg({'revenue':'mean'}).reset_index()

order_source_revenue.head()
Out[78]:
source_id order_month revenue
0 1 1 5.261758
1 1 2 4.836622
2 1 3 6.975884
3 1 4 7.523467
4 1 5 5.538221
In [79]:
fig=px.line(order_source_revenue, x='order_month', y='revenue', color='source_id',
            labels={'order_month':'Month','revenue':'Purchases Sum'},
            title="verage Sum Of Purchases Per Month Per Ad Source")

fig.update_layout(
    autosize=False,
    width=1000,
    height=700)


fig.show()

Very interesting graph when compared to the "Average Number of Purchases Per Month Per Ad Source". here we see that most of the spike that we have during christmass time comes from one source (number 2). And in number of purchases source number 4 was in the lead, Here we see that it is 4th most of the year when it coumes to sum of ther purchases. This means that maybe it brings a lot of buying customers, They buy in smaller amounts relative to other sources. So here on top we have source number 2 and behind it we have source number 1.

sources 9-10 that didnt bring a lot of purchases, Here in the sum of purchases they show more promise, but are still on the lower end.

How much money do they bring?

In [80]:
orders['order_month_datetime'] = orders['order_date'].astype('datetime64[M]')
orders_first = orders.groupby('uid').agg({'order_month_datetime':'min'}).reset_index()
orders_first.columns = ('uid','first_order_month')
orders_cohort_size = orders_first.groupby('first_order_month').agg({'uid':'nunique'}).reset_index()
orders_cohort_size.columns = ('first_order_month','n_buyers')
orders_= pd.merge(orders,orders_first, on='uid')
cohorts_orders = orders_.groupby(['first_order_month','order_month_datetime']).agg({'revenue':'sum'}).reset_index()

cohorts_orders.sample(10)
Out[80]:
first_order_month order_month_datetime revenue
76 2018-04-01 2018-05-01 1209.92
48 2017-10-01 2018-04-01 368.10
46 2017-10-01 2018-02-01 657.77
9 2017-06-01 2018-03-01 1225.51
4 2017-06-01 2017-10-01 2068.58
34 2017-09-01 2017-10-01 2884.49
29 2017-08-01 2018-02-01 288.61
65 2018-01-01 2018-03-01 1026.34
15 2017-07-01 2017-10-01 690.31
39 2017-09-01 2018-03-01 1810.85
In [81]:
report = pd.merge(orders_cohort_size, cohorts_orders, on='first_order_month')
report.head()
Out[81]:
first_order_month n_buyers order_month_datetime revenue
0 2017-06-01 2023 2017-06-01 9557.49
1 2017-06-01 2023 2017-07-01 981.82
2 2017-06-01 2023 2017-08-01 885.34
3 2017-06-01 2023 2017-09-01 1931.30
4 2017-06-01 2023 2017-10-01 2068.58
In [82]:
margin_rate =.5

report['gp']= report['revenue']*margin_rate
report['age']= (report['order_month_datetime']-report['first_order_month'])/np.timedelta64(1, 'M')
report['age'] = report['age'].round().astype('int')
report['ltv'] = report['gp'] / report['n_buyers']
report['just_date'] = report['first_order_month'].dt.date

report.head(20)
Out[82]:
first_order_month n_buyers order_month_datetime revenue gp age ltv just_date
0 2017-06-01 2023 2017-06-01 9557.49 4778.745 0 2.362207 2017-06-01
1 2017-06-01 2023 2017-07-01 981.82 490.910 1 0.242664 2017-06-01
2 2017-06-01 2023 2017-08-01 885.34 442.670 2 0.218819 2017-06-01
3 2017-06-01 2023 2017-09-01 1931.30 965.650 3 0.477336 2017-06-01
4 2017-06-01 2023 2017-10-01 2068.58 1034.290 4 0.511265 2017-06-01
5 2017-06-01 2023 2017-11-01 1487.92 743.960 5 0.367751 2017-06-01
6 2017-06-01 2023 2017-12-01 1922.74 961.370 6 0.475220 2017-06-01
7 2017-06-01 2023 2018-01-01 1176.56 588.280 7 0.290796 2017-06-01
8 2017-06-01 2023 2018-02-01 1119.15 559.575 8 0.276607 2017-06-01
9 2017-06-01 2023 2018-03-01 1225.51 612.755 9 0.302894 2017-06-01
10 2017-06-01 2023 2018-04-01 1155.66 577.830 10 0.285630 2017-06-01
11 2017-06-01 2023 2018-05-01 519.62 259.810 11 0.128428 2017-06-01
12 2017-07-01 1923 2017-07-01 11557.65 5778.825 0 3.005109 2017-07-01
13 2017-07-01 1923 2017-08-01 644.61 322.305 1 0.167605 2017-07-01
14 2017-07-01 1923 2017-09-01 1199.05 599.525 2 0.311765 2017-07-01
15 2017-07-01 1923 2017-10-01 690.31 345.155 3 0.179488 2017-07-01
16 2017-07-01 1923 2017-11-01 339.97 169.985 4 0.088396 2017-07-01
17 2017-07-01 1923 2017-12-01 300.08 150.040 5 0.078024 2017-07-01
18 2017-07-01 1923 2018-01-01 231.16 115.580 6 0.060104 2017-07-01
19 2017-07-01 1923 2018-02-01 272.72 136.360 7 0.070910 2017-07-01
In [83]:
output = report.pivot_table(
    index='just_date', 
    columns='age', 
    values='ltv', 
    aggfunc='mean')

output.fillna('') 
Out[83]:
age 0 1 2 3 4 5 6 7 8 9 10 11
just_date
2017-06-01 2.362207 0.242664 0.218819 0.477336 0.511265 0.367751 0.47522 0.290796 0.276607 0.302894 0.28563 0.128428
2017-07-01 3.005109 0.167605 0.311765 0.179488 0.0883957 0.0780239 0.060104 0.07091 0.0806162 0.0735725 0.0778367
2017-08-01 2.638259 0.235996 0.229241 0.195639 0.247026 0.14177 0.105332 0.202504 0.146106 0.0939891
2017-09-01 2.822265 0.558793 0.260465 1.9879 0.200279 0.323534 0.350804 0.121575 0.0920031
2017-10-01 2.501866 0.267881 0.095697 0.0785726 0.07578 0.0601809 0.0424078 0.057735
2017-11-01 2.577341 0.199617 0.0997782 0.162476 0.0740064 0.0269395 0.0574639
2017-12-01 2.369095 0.130187 0.462548 0.532637 0.156465 0.169023
2018-01-01 2.067818 0.147379 0.152141 0.071389 0.0313489
2018-02-01 2.078494 0.139137 0.0392577 0.037072
2018-03-01 2.419401 0.150446 0.15778
2018-04-01 2.328798 0.2658
2018-05-01 2.330281
2018-06-01 1.710000
In [84]:
sns.set(style='white')
plt.figure(figsize=(13, 9))

cmap = mcolors.LinearSegmentedColormap.from_list(name="n",colors=['azure','purple','blue','green','yellow','orange','red'])

sns.heatmap(output, annot=True,  linewidths=1, linecolor='gray',cmap='RdBu_r')


plt.title('Cohorts: LTV', fontdict={'size':15})
plt.xlabel("Ltv")
plt.ylabel("First Session")
plt.show()

We can see that the LTV for each cohort is high on the first month and then it goes down drasticaly, I belive that most users tend to buy on their first days after they register and after that less users come back, As we saw before in this progect.

Marketing

How much money was spent? Overall/per source/over time

In [85]:
overall_cost = costs['costs'].sum()
print('Overall costs are: {}'.format(overall_cost))
Overall costs are: 329131.62

I will check costs per ad source.

In [86]:
costs_grouped_source = costs.groupby('source_id').agg({'costs':'sum'}).reset_index()

fig,ax=plt.subplots(figsize=(6,4))
ax.vlines(x=costs_grouped_source.source_id, ymin=0,ymax=costs_grouped_source.costs, color='red',
          alpha=0.7, linewidth=10)
ax.set_title('Costs By Ad source',fontdict={'size':15})
ax.set_xlabel('Ad Source')
ax.set_ylabel('Cost')

plt.grid()
plt.show()

This graph shows us the costs per ad source, We can see that source number 3 takes almost half of the whole budget. We can also see that sources 7,6 and 8 are missing, therfore we are not paying for them. That explains why so little of our users come from them. we can also explain why sources 9 and 10 bring little users, they are a small fraction the cost of the rest of the sources.

Now i will check cost per month.

In [87]:
costs['month']=costs['dt'].dt.month
In [88]:
costs_grouped_source
Out[88]:
source_id costs
0 1 20833.27
1 10 5822.49
2 2 42806.04
3 3 141321.63
4 4 61073.60
5 5 51757.10
6 9 5517.49
In [89]:
costs_grouped_month = costs.groupby('month').agg({'costs':'sum'}).reset_index()

fig,ax=plt.subplots(figsize=(6,4))
ax.vlines(x=costs_grouped_month.month, ymin=0,ymax=costs_grouped_source.costs, color='red',
          alpha=0.7, linewidth=10)
ax.set_title('Costs By Month',fontdict={'size':15})
ax.set_xlabel('Month')
ax.set_ylabel('Cost')

plt.grid()
plt.show()

There are two big pushes a year, One in april and the other just before tchristmas and there are three months where there is almost no money going to ad sources, Febuary, july and September. Allthogh there is a push, or a high cost for ads around april we see the opposite effects in the graphs we plotted before, there is a drop in sales after april.

How much did customer acquisition from each of the sources cost?

I will calculate thw CAC for each of the sources

In [90]:
last_visits = visits.sort_values(by='start_ts', ascending=True
                                ).groupby('uid', as_index=False).agg({'source_id':'last'})

last_visits_grouped = last_visits.groupby('source_id').agg({'uid':'count'}).reset_index()

cost_grouped= pd.merge(last_visits_grouped,costs_grouped_source,on='source_id',)
In [91]:
cost_grouped['cac']= cost_grouped['costs']/ cost_grouped['uid']
cost_grouped
Out[91]:
source_id uid costs cac
0 1 12974 20833.27 1.605771
1 10 6644 5822.49 0.876353
2 2 20427 42806.04 2.095562
3 3 60386 141321.63 2.340305
4 4 71958 61073.60 0.848740
5 5 48591 51757.10 1.065158
6 9 7155 5517.49 0.771138
In [92]:
fig,ax=plt.subplots(figsize=(6,4))
ax.vlines(x=cost_grouped.source_id, ymin=0,ymax=cost_grouped.cac, color='red',
          alpha=0.7, linewidth=10)
ax.set_title('Cac By Ad Source',fontdict={'size':15})
ax.set_xlabel('Ad Source')
ax.set_ylabel('Cac')

plt.grid()
plt.show()

This graph shows us customer Acquisition Cost per ad source, there are several as sources that we did not pay for and they are not in the graph. the most cost efective were sources 10,4 and 9 and the priciest were 2 and 3. Source 2 was one of the priciest but we saw in a graph earlier that it brought us the most revenue, so it was worth while. On the other hand source 10 didnt bring us as much revenue so i think this ad source shold be reconsiderd.

How worthwhile where the investments?

I'll start by calculationg ROI for each cohort. I have the 'report' DF and i will build on it.

In [93]:
costs['dt'].describe()
Out[93]:
count                    2542
unique                    364
top       2018-03-03 00:00:00
freq                        7
first     2017-06-01 00:00:00
last      2018-05-31 00:00:00
Name: dt, dtype: object
In [94]:
orders['buy_ts'].describe()
Out[94]:
count                   50415
unique                  45991
top       2018-05-31 10:13:00
freq                        9
first     2017-06-01 00:10:00
last      2018-06-01 00:02:00
Name: buy_ts, dtype: object
In [95]:
costs['cost_month']= costs['dt'].astype('datetime64[M]')

I can see that the costs DF is one month shorter then the orders DF so we wount have cost data on the last month that we have data on the orders.

In [96]:
#visits_ad_source = visits_big.groupby('uid').agg({'source_id':'last'}).reset_index()
visits_ad_source = visits_big.groupby('uid').agg({'uid':'last','source_id':'last'})
visits_ad_source.columns =['uid','uid_ad_source']
visits_ad_source.head()
Out[96]:
uid uid_ad_source
uid
11863502262781 11863502262781 3
49537067089222 49537067089222 2
297729379853735 297729379853735 3
313578113262317 313578113262317 2
325320750514679 325320750514679 5
In [97]:
visits_ad_source.reset_index(drop=True,inplace=True)
visits_ad_source.head()
Out[97]:
uid uid_ad_source
0 11863502262781 3
1 49537067089222 2
2 297729379853735 3
3 313578113262317 2
4 325320750514679 5
In [98]:
orders_ad_source = pd.merge(visits_ad_source,orders,on='uid')
orders_ad_source.head()
Out[98]:
uid uid_ad_source buy_ts revenue order_date order_week order_month order_month_datetime
0 313578113262317 2 2018-01-03 21:51:00 0.55 2018-01-03 1 1 2018-01-01
1 1575281904278712 10 2017-06-03 10:13:00 1.22 2017-06-03 22 6 2017-06-01
2 1575281904278712 10 2017-06-03 17:39:00 1.83 2017-06-03 22 6 2017-06-01
3 2429014661409475 3 2017-10-11 18:33:00 73.33 2017-10-11 41 10 2017-10-01
4 2464366381792757 5 2018-01-28 15:54:00 2.44 2018-01-28 4 1 2018-01-01
In [99]:
ad_cohort_size = orders_ad_source.groupby('uid_ad_source').agg({'uid':'nunique'}).reset_index()
ad_cohort_size.columns = ['uid_ad_source','n_users']
ad_cohort_size
Out[99]:
uid_ad_source n_users
0 1 3831
1 10 1230
2 2 4239
3 3 9158
4 4 9894
5 5 6753
6 7 1
7 9 1417
In [100]:
ad_cohorts = orders_ad_source.groupby(['uid_ad_source','order_month_datetime']).agg({'revenue':'sum'}).reset_index()
report_ad = pd.merge(ad_cohorts,ad_cohort_size,on='uid_ad_source')
report_ad.head()
Out[100]:
uid_ad_source order_month_datetime revenue n_users
0 1 2017-06-01 1438.70 3831
1 1 2017-07-01 1589.99 3831
2 1 2017-08-01 1403.99 3831
3 1 2017-09-01 2794.27 3831
4 1 2017-10-01 4322.47 3831
In [101]:
report_ad['gp']= report_ad['revenue']*margin_rate
report_ad['ltv'] = report_ad['gp'] / report_ad['n_users']
report_ad.rename(columns={'uid_ad_source':'source_id'},inplace=True)
report_ad.head()
Out[101]:
source_id order_month_datetime revenue n_users gp ltv
0 1 2017-06-01 1438.70 3831 719.350 0.187771
1 1 2017-07-01 1589.99 3831 794.995 0.207516
2 1 2017-08-01 1403.99 3831 701.995 0.183241
3 1 2017-09-01 2794.27 3831 1397.135 0.364692
4 1 2017-10-01 4322.47 3831 2161.235 0.564144
In [102]:
monthly_ad_costs = costs.groupby(['source_id','cost_month']).sum().reset_index()
monthly_ad_costs.rename(columns={'cost_month':'order_month_datetime'},inplace=True)
monthly_ad_costs.head()
Out[102]:
source_id order_month_datetime costs month
0 1 2017-06-01 1125.61 180
1 1 2017-07-01 1072.88 217
2 1 2017-08-01 951.81 248
3 1 2017-09-01 1502.01 270
4 1 2017-10-01 2315.75 310
In [103]:
report_ad_ = pd.merge(report_ad,monthly_ad_costs,on=['source_id','order_month_datetime'],how='outer')
report_ad_.head()
Out[103]:
source_id order_month_datetime revenue n_users gp ltv costs month
0 1 2017-06-01 1438.70 3831 719.350 0.187771 1125.61 180.0
1 1 2017-07-01 1589.99 3831 794.995 0.207516 1072.88 217.0
2 1 2017-08-01 1403.99 3831 701.995 0.183241 951.81 248.0
3 1 2017-09-01 2794.27 3831 1397.135 0.364692 1502.01 270.0
4 1 2017-10-01 4322.47 3831 2161.235 0.564144 2315.75 310.0
In [104]:
report_ad_['cac']= report_ad_['costs']/report_ad_['n_users']
report_ad_['romi']= report_ad_['ltv']/report_ad_['cac']
report_ad_['just_date'] = report_ad_['order_month_datetime'].dt.date
report_ad_.head()
Out[104]:
source_id order_month_datetime revenue n_users gp ltv costs month cac romi just_date
0 1 2017-06-01 1438.70 3831 719.350 0.187771 1125.61 180.0 0.293816 0.639076 2017-06-01
1 1 2017-07-01 1589.99 3831 794.995 0.207516 1072.88 217.0 0.280052 0.740992 2017-07-01
2 1 2017-08-01 1403.99 3831 701.995 0.183241 951.81 248.0 0.248449 0.737537 2017-08-01
3 1 2017-09-01 2794.27 3831 1397.135 0.364692 1502.01 270.0 0.392067 0.930177 2017-09-01
4 1 2017-10-01 4322.47 3831 2161.235 0.564144 2315.75 310.0 0.604477 0.933276 2017-10-01
In [105]:
output_romi = report_ad_.pivot_table(index='source_id',columns='just_date',values='romi',aggfunc='mean')
output_romi_cumsum = output_romi.cumsum(axis=1).round(2)
output_romi_cumsum
Out[105]:
just_date 2017-06-01 2017-07-01 2017-08-01 2017-09-01 2017-10-01 2017-11-01 2017-12-01 2018-01-01 2018-02-01 2018-03-01 2018-04-01 2018-05-01
source_id
1 0.64 1.38 2.12 3.05 3.98 5.14 6.06 6.76 7.53 8.92 10.17 11.55
10 0.32 0.50 0.88 1.08 1.49 1.77 2.00 2.26 2.60 3.14 3.51 4.10
2 0.29 0.84 1.14 1.75 2.43 2.93 4.42 4.81 5.62 6.67 7.33 8.32
3 0.17 0.34 0.47 0.65 0.80 0.95 1.12 1.25 1.40 1.58 1.71 1.90
4 0.25 0.67 0.97 1.32 1.76 2.30 2.69 3.10 3.55 4.01 4.43 4.87
5 0.35 0.75 1.23 1.76 2.24 2.54 2.95 3.29 3.74 4.17 4.56 5.01
9 0.55 1.12 1.76 2.34 3.03 3.72 4.26 4.79 5.38 6.19 6.59 6.99
In [106]:
sns.set(style='white')
plt.figure(figsize=(15, 9))


sns.heatmap(output_romi_cumsum, annot=True,  linewidths=1, linecolor='gray',cmap='RdBu_r')


plt.title('Ad source: Romi', fontdict={'size':15})
plt.xlabel("Month")
plt.ylabel("Ad Source")
plt.xticks(rotation=40)
plt.show()
In [107]:
output_romi_cumsum.head()
Out[107]:
just_date 2017-06-01 2017-07-01 2017-08-01 2017-09-01 2017-10-01 2017-11-01 2017-12-01 2018-01-01 2018-02-01 2018-03-01 2018-04-01 2018-05-01
source_id
1 0.64 1.38 2.12 3.05 3.98 5.14 6.06 6.76 7.53 8.92 10.17 11.55
10 0.32 0.50 0.88 1.08 1.49 1.77 2.00 2.26 2.60 3.14 3.51 4.10
2 0.29 0.84 1.14 1.75 2.43 2.93 4.42 4.81 5.62 6.67 7.33 8.32
3 0.17 0.34 0.47 0.65 0.80 0.95 1.12 1.25 1.40 1.58 1.71 1.90
4 0.25 0.67 0.97 1.32 1.76 2.30 2.69 3.10 3.55 4.01 4.43 4.87

We can see in the graph that by the 4rd month most of the ad sources started to pay off. source number 3 is the last one to pay off and it did so only in the 6th month. the fastests and most profitable according to this table is source number 1 that started to be profitable on the second month and in the last data point we have brings 12 times more profit then cost.

Step 3. Conclusion

In this Project we looked at data of visits orders and ad costs for Yandex Afisha.

We saw that the monthly, weekly and daily number of users and number of sessions is changing mainly due to seasonal changes (holidays ect..). We saw that overall most users are logging in through their desktop devices, much more than on their touch device and the desktop session length in almost twice as long.

When we looked at conversion rates we calculated that most users that will convert will do so on the first day (almost 69%) , if not they will either convert on the first week(11%) or only after the first month (13%), they are unlikely to convert in first month if it didn't happen within the first week. When looking by cohort, The conversion rate for the first week was rising in each cohort and reached the peak in the last cohort, the 2018-05 cohort.

When it comes to marketing first thing is first. Something has to change with source number 3, It is the most expensive, when we looked at the ROI it was the last one to become profitable. We can also see that the users it brings are not the best spenders. I would recommend cutting the expenses on this ad source by a lot. Second thing is sources 1 and 2 should get some more of the overall marketing budget. They are the fastest to get profitable and they bring users that, on average, spend more than other resources, rising their budget will increase the companies profits. Source number 9 is another one I would recommend giving more of the budget to. As of now it is one of the cheapest sources and accordingly it isn't high on the number of users it brings nor do those users spend a lot but its ROI is the third highest.

One more thing i would look into is the abnormal dip in the number of sessions and active users around the beginning of april, When we looked at the graphs of user activity there was a really hard dip in that area, maybe there was a technical thing on the site that stopped users from logging in. This brings us to the monthly ad budget. We see a rise in the cost around the end of the year, That rise corresponds with a very high rise in total sales . For some reason we see a sharp rise in the ad costs in the month of April, This rise does not correspond with any rise in sales around that time, Actually we can see the opposite, we see a general fall in sales around this time. So i will recommend to lower the ad costs of the month of April seeing that there is not much benefits it brings and it one of the most expensive months.